June 6, 2017
(taken from public description of UBC MDS course DSCI 561)
By the end of the course, students are expected to be able to:
(adapted from public description of UBC MDS course DSCI 561)
1/4 of the way into the course. Before this lesson, students will have learned the following:
aov()lm() (including reference-treatment parameterization)By the end of this lesson students are expected to be able to:
slides and code available at: https://github.com/ttimbers/UBC-stat-sample-lesson
US property data from 2015 extracted from the Data USA API using Python https://datausa.io/
| display_name | income | mean_commute_minutes | median_property_value | non_eng_speakers_pct | owner_occupied_housing_units | pop |
|---|---|---|---|---|---|---|
| Texas | 53207 | 24.5090 | 136000 | 0.3503480 | 0.622325 | 26538614 |
| Pennsylvania | 53599 | 25.2695 | 166000 | 0.1064820 | 0.692052 | 12779559 |
| South Carolina | 45483 | 23.0552 | 139900 | 0.0686766 | 0.685914 | 4777576 |
| New Hampshire | 66779 | 25.2973 | 237300 | 0.0786821 | 0.709609 | 1324201 |
| Kansas | 52205 | 18.2779 | 132000 | 0.1130530 | 0.666891 | 2892987 |
| Hawaii | 69515 | 25.5813 | 515300 | 0.2521790 | 0.569030 | 1406299 |
## [1] 52 7
US State property value as a function of income
\(Y = \textrm{State}\: \textrm{median}\: \textrm{property}\: \textrm{value}\) \(X_{1} = \textrm{income}\)
\(\textrm{State}\: \textrm{median}\: \textrm{property}\: \textrm{value} = \beta_{0} + \beta_{1}\textrm{income} + \varepsilon\)
| Mathematical formula | model formula in R |
|---|---|
| \(Y = \beta_{0} + \beta_{1}X_{1} + \varepsilon\) | Y ~ X1 |
| Mathematical formula | model formula in R |
|---|---|
| \(Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{3} + \varepsilon\) | Y ~ X1 + X2 + X3 |
| \(Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{1}X_{2} + \varepsilon\) | Y ~ X1 * X2 |
\(\textrm{State}\: \textrm{median}\: \textrm{property}\: \textrm{value} = \beta_{0} + \beta_{1}\textrm{income} + \varepsilon\)
prop_model <- lm(median_property_value ~ income, data = prop_data)
create linear model object and view output in base R:
prop_model <- lm(median_property_value ~ income, data = prop_data) summary(prop_model)
## ## Call: ## lm(formula = median_property_value ~ income, data = prop_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -65249 -36542 -6990 8003 219312 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.503e+05 4.393e+04 -3.422 0.00125 ** ## income 6.420e+00 7.999e-01 8.026 1.52e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 58860 on 50 degrees of freedom ## Multiple R-squared: 0.563, Adjusted R-squared: 0.5542 ## F-statistic: 64.41 on 1 and 50 DF, p-value: 1.517e-10
summary():| model output | code |
|---|---|
| parameter/coefficient estimates (e.g., \(\beta_{0}\) & \(\beta_{1}\)) | model_object$coefficients |
| residuals | model_object$residuals |
| predicted/fitted values | model_object$fitted.values |
| p-values for coefficients | summary(model_object)$coefficients[,4] |
| \(\sigma\) estimate | summary(model_object)$sigma |
| \(R^{2}\) | summary(model_object)$r.squared |
Working with model output in base R,
the good, the bad and the ugly…
summary(model_object)$coefficient)broom:
A better way for working with model output in R
broom for working with model output in R:broom for working with model output in R:broom function |
model output |
|---|---|
tidy(model_object) |
model parameters (e.g., coefficients and p-values) |
augment(model_object) |
model data (e.g., residuals and predicted values) |
glance(model_object) |
model quality, complexity and summaries (e.g., \(\sigma\) estimate and \(R^{2}\)) |
tidy()tidy(prop_model)
## term estimate std.error statistic p.value ## 1 (Intercept) -1.503050e+05 4.392739e+04 -3.421670 1.248896e-03 ## 2 income 6.420101e+00 7.999334e-01 8.025795 1.517389e-10
augment()augment(prop_model)
## median_property_value income .fitted .se.fit .resid .hat ## 1 136000 53207 191289.28 8184.217 -55289.283 0.01933481 ## 2 166000 53599 193805.96 8167.204 -27805.963 0.01925451 ## 3 139900 45483 141700.42 10610.207 -1800.420 0.03249626 ## 4 237300 66779 278422.90 13107.757 -41122.898 0.04959552 ## 5 132000 52205 184856.34 8281.684 -52856.341 0.01979808 ## 6 515300 69515 295988.30 14882.799 219311.705 0.06393739 ## 7 162900 47583 155182.63 9624.070 7717.367 0.02673642 ## 8 193500 47169 152524.71 9803.561 40975.289 0.02774300 ## 9 160300 44963 138361.97 10880.682 21938.033 0.03417416 ## 10 283400 59269 230207.94 9201.822 53192.063 0.02444181 ## 11 153800 57181 216802.77 8559.790 -63002.765 0.02115007 ## 12 237300 51243 178680.20 8446.070 58619.796 0.02059184 ## 13 129200 53183 191135.20 8185.648 -61935.200 0.01934157 ## 14 122400 49576 167977.89 8882.875 -45577.895 0.02277680 ## 15 129900 49429 167034.14 8929.926 -37134.140 0.02301873 ## 16 148100 49620 168260.38 8869.046 -20160.379 0.02270594 ## 17 247800 60629 238939.27 9752.014 8860.725 0.02745202 ## 18 159000 47507 154694.71 9656.419 4305.295 0.02691646 ## 19 286900 74551 328319.93 18384.621 -41419.925 0.09756522 ## 20 217500 55176 203930.46 8220.159 13569.538 0.01950501 ## 21 259500 61062 241719.18 9945.789 17780.821 0.02855382 ## 22 167500 50255 172337.14 8682.917 -4837.144 0.02176291 ## 23 124200 49255 165917.04 8987.290 -41717.042 0.02331542 ## 24 123200 43740 130510.18 11550.947 -7310.184 0.03851420 ## 25 144100 45047 138901.26 10836.365 5198.744 0.03389635 ## 26 173800 49331 166404.97 8962.014 7395.030 0.02318445 ## 27 186200 61492 244479.82 10146.265 -58279.822 0.02971653 ## 28 138400 48173 158970.49 9382.549 -20570.493 0.02541133 ## 29 133200 52997 189941.06 8198.252 -56741.062 0.01940118 ## 30 238000 56852 214690.55 8484.221 23309.448 0.02077828 ## 31 142100 45219 140005.51 10746.364 2094.487 0.03333564 ## 32 245000 65015 267097.84 12035.754 -22097.839 0.04181502 ## 33 165800 53357 192252.30 8176.291 -26452.298 0.01929738 ## 34 194800 58840 227453.71 9048.489 -32653.714 0.02363403 ## 35 120500 19350 -26076.09 28861.891 146576.088 0.24045578 ## 36 117900 46879 150662.88 9933.937 -32762.882 0.02848581 ## 37 250000 72515 315248.60 16940.706 -65248.599 0.08284164 ## 38 231500 60509 238168.86 9699.815 -6668.863 0.02715893 ## 39 173700 51847 182557.94 8334.941 -8857.945 0.02005353 ## 40 111400 41371 115300.96 12961.218 -3900.964 0.04849281 ## 41 173800 57574 219325.87 8659.682 -45525.865 0.02164660 ## 42 385500 61818 246572.78 10303.309 138927.225 0.03064355 ## 43 475800 70848 304546.29 15785.281 171253.710 0.07192673 ## 44 270500 70331 301227.10 15432.775 -30727.098 0.06875016 ## 45 154900 46868 150592.26 9938.956 4307.740 0.02851459 ## 46 333100 68563 289876.36 14252.125 43223.641 0.05863338 ## 47 103800 41751 117740.60 12726.528 -13940.602 0.04675258 ## 48 215900 60727 239568.44 9795.134 -23668.445 0.02769532 ## 49 125500 43623 129759.03 11617.360 -4259.032 0.03895835 ## 50 140500 50957 176844.05 8507.762 -36344.055 0.02089375 ## 51 103100 39665 104348.27 14047.630 -1248.271 0.05696286 ## 52 315900 72093 312539.32 16645.694 3360.684 0.07998149 ## .sigma .cooksd .std.resid ## 1 58918.37 8.870254e-03 -0.94857880 ## 2 59320.33 2.233837e-03 -0.47703760 ## 3 59455.21 1.624170e-05 -0.03109857 ## 4 59149.62 1.340135e-02 -0.71667509 ## 5 58964.59 8.308867e-03 -0.90705194 ## 6 49863.41 5.065524e-01 3.85125431 ## 7 59445.28 2.426254e-04 0.13290666 ## 8 59158.67 7.111988e-03 0.70603190 ## 9 59370.20 2.544786e-03 0.37926355 ## 10 58955.92 1.048761e-02 0.91498312 ## 11 58755.71 1.264604e-02 -1.08191811 ## 12 58850.56 1.064662e-02 1.00636443 ## 13 58780.62 1.113492e-02 -1.06260413 ## 14 59089.83 7.151043e-03 -0.78333984 ## 15 59213.05 4.799670e-03 -0.63829742 ## 16 59384.37 1.394576e-03 -0.34648059 ## 17 59441.93 3.288868e-04 0.15265344 ## 18 59452.52 7.604647e-05 0.07415162 ## 19 59128.61 2.966454e-02 -0.74078851 ## 20 59423.55 5.391892e-04 0.23282799 ## 21 59399.90 1.380658e-03 0.30650339 ## 22 59451.68 7.680021e-05 -0.08309211 ## 23 59149.19 6.139276e-03 -0.71718167 ## 24 59446.25 3.213271e-04 -0.12666297 ## 25 59450.98 1.416635e-04 0.08986269 ## 26 59446.18 1.917818e-04 0.12712369 ## 27 58851.94 1.547366e-02 -1.00522143 ## 28 59381.22 1.633916e-03 -0.35401923 ## 29 58889.60 9.375535e-03 -0.97351940 ## 30 59360.48 1.699289e-03 0.40020665 ## 31 59455.01 2.258750e-05 0.03619367 ## 32 59368.26 3.209881e-03 -0.38354645 ## 33 59333.21 2.026311e-03 -0.45382412 ## 34 59268.06 3.815341e-03 -0.56146036 ## 35 54384.95 1.292442e+00 2.85745521 ## 36 59265.86 4.675737e-03 -0.56474235 ## 37 58653.71 6.051431e-02 -1.15755561 ## 38 59447.94 1.841993e-04 -0.11487451 ## 39 59442.04 2.364872e-04 -0.15202837 ## 40 59453.04 1.176394e-04 -0.06794519 ## 41 59091.09 6.765037e-03 -0.78199354 ## 42 55934.29 9.084530e-02 2.39738865 ## 43 53759.43 3.534772e-01 3.02024164 ## 44 59281.53 1.080289e-02 -0.54098008 ## 45 59452.51 8.091875e-05 0.07425473 ## 46 59114.19 1.784127e-02 0.75689352 ## 47 59420.79 1.443155e-03 -0.24258925 ## 48 59356.82 2.368628e-03 -0.40781329 ## 49 59452.55 1.104317e-04 -0.07381296 ## 50 59223.80 4.155077e-03 -0.62403838 ## 51 59455.50 1.440480e-05 -0.02183922 ## 52 59453.68 1.540307e-04 0.05952812
glance()glance(prop_model)
## r.squared adj.r.squared sigma statistic p.value df logLik ## 1 0.5629882 0.554248 58858.23 64.41339 1.517389e-10 2 -643.8752 ## AIC BIC deviance df.residual ## 1 1293.75 1299.604 173214534971 50
model diagnostics
dealing with nonlinear terms